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Abstract. This paper examines experimental design procedures used to develop surrogates of computational 
models, exploring the interplay between experimental designs and approximation algorithms. We 
focus on two widely used approximation approaches, Gaussian process (GP) regression and non- 
intrusive polynomial approximation. First, we introduce algorithms for minimizing a posterior 
integrated variance (IVAR) design criterion for GP regression. Our formulation treats design as 
a continuous optimization problem that can be solved with gradient-based methods on complex 
input domains, without resorting to greedy approximations. We show that minimizing IVAR in 
this way yields point sets with good interpolation properties, and that it enables more accurate 
GP regression than designs based on entropy minimization or mutual information maximization. 
Second, using a Mercer kernel/eigenfunction perspective on GP regression, we identify conditions 
under which GP regression coincides with pseudospectral polynomial approximation. Departures 
from these conditions can be understood as changes either to the kernel or to the experimental 
design itself. We then show how IVAR-optimal designs, while sacrificing discrete orthogonality 
of the kernel eigenfunctions, can yield lower approximation error than orthogonalizing point 
sets. Finally, we compare the performance of adaptive Gaussian process regression and adaptive 
pseudospectral approximation for several classes of target functions, identifying features that are 
favorable to the GP + IVAR approach. 

Key words. Gaussian process regression, experimental design, computer experiments, approximation theory, 
polynomial approximation, kernel interpolation, uncertainty quantification 


1. Introduction. Computational simulations are essential for design, optimization, un¬ 
certainty quantification, and inference in complex systems. Yet these tasks typically require 
a large number of simulations over a range of parameter values, which can be computation¬ 
ally prohibitive. One method of mitigating this computational expense is to construct 
surrogates or “emulators” that replace the simulation in the relevant analyses. Because 
many computational simulations are available only as black-box or legacy codes, surrogates 
often must be constructed through a limited number of simulations at particular parameter 
values {xi}. These simulations are sometimes called “computer experiments” [52, 53] and 
choosing these parameter values is a question of experimental design. 

Surrogate construction can be viewed as a function approximation problem in which 
one attempts to approximate an input-output relationship f(x ) induced by the expensive 
simulation. One can do so deterministically, i.e., obtaining a single approximation /, or 
probabilistically, i.e., obtaining a distribution over possible functions / ~ J 7 , where J 7 is a 
probability distribution on a suitable function space. In either case, three decisions must 
be made. First, one must choose an approximation space that contains candidate surrogate 
functions; second, one must select a set of parameter values or experiments at which to 
simulate the system; and finally, one must choose an algorithm to convert the simulation 
results into a particular function f or distribution J 7 . Examples of approximation spaces 
include those spanned by polynomials of a certain degree, or by radial basis functions and 
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other kernels. Possible experiments include designs produced by Monte Carlo sampling, 
obtained via optimization of information-based design criteria such as entropy and mutual 
information, or based on numerical quadrature rules. Finally, examples of algorithms in¬ 
clude linear regression and pseudospectral approximation. These three decisions are usually 
not independent; for example, pseudospectral approximation requires experimental design 
procedures that preserve orthogonality of the relevant basis functions. 

In this paper, we analyze two of the most commonly used surrogate construction ap¬ 
proaches in uncertainty quantification: Gaussian process regression (GPR), which has seen 
much development in the statistics community, and pseudospectral approximation (PSA), 
which is widely used in applied mathematics and engineering. Both of these methods are 
routinely applied for similar purposes—replacing computationally expensive models with 
cheap-to-evaluate surrogates for complex analysis tasks. One of our primary goals is to 
compare these two approaches by analyzing the approximation spaces, experiments, and al¬ 
gorithms that they employ. The PSA method comes equipped with an experimental design 
procedure and an algorithm intended to produce approximations that are accurate in an L 2 
sense. The GPR methodology is more flexible in that it does not come pre-equipped with 
an experimental design procedure. In order to compare these two methods, we will employ 
an experimental design criterion that also seeks accuracy in an L 2 sense. 

Our main contributions are as follows. First, we develop a continuous optimization algo¬ 
rithm, based on sample-average approximation (SAA), to minimize an integrated posterior 
variance (IVAR) design criterion for Gaussian process regression. We compare our algo¬ 
rithm to approaches that maximize other information-based criteria (e.g., entropy or mutual 
information) by evaluating their computational costs, the properties of the resulting point 
sets, and the accuracy of the resulting approximations. Second, we provide a theoretical 
and numerical analysis comparing non-intrusive polynomial approximation—in particular 
pseudospectral polynomial approximation—with Gaussian process regression. While the 
relative performance of these methods may be a subject of broader debate, here we assess 
the impact of each of the three surrogate construction ingredients described above. Theo¬ 
retically, we develop results to describe the difference between the approximations given the 
same experiments and similar approximation spaces. Numerically, we investigate the perfor¬ 
mance of the two approaches given similar approximation spaces but different experiments, 
chosen to be optimal for each. 

The IVAR objective, which is equivalent to the IMSE criterion [52, 9], can be minimized 
either over a finite (and hence discrete) design space or over a continuous design space. In 
the discrete case, the criterion is sometimes called the ALC (“active learning Cohn”) [6] 
objective. Minimizing the ALC criterion involves sequentially adding experiments, chosen 
one at a time from a discrete and finite set of candidates, to minimize a weighted average 
of the predictive variance. ALC has also been investigated in the context of determining 
local designs for large-scale computer experiments [27]. ALC is often compared to other 
discrete design space criteria, namely ALM (“active learning MacKay”) and mutual infor¬ 
mation (MI) [34]. Minimizing the ALM criterion involves sequentially adding experiments 
at locations where the local predictive variance is maximized. Compared with ALM, the 
ALC criterion considers the effect of each experiment on the entire domain and therefore 
yields better designs; ALC is more expensive, however, because it requires a new variance 
computation for each potential design. The MI criterion sequentially seeks points that max- 
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imize the expected information gain at locations not yet chosen. MI design requires a good 
candidate set, which may be difficult to obtain for input domains with complex geome¬ 
tries, though strides have been made in this direction [3]. It also remains computationally 
expensive, with a complexity that grows cubically with the size of the candidate set. 

Instead of dealing with the combinatorial optimization issues associated with discrete 
design spaces, we will pursue optimization in a continuous space. Our approach is similar 
to [51, 52] in that we use gradient-based methods to search for optimal designs, but we will 
explore opportunities presented by solving the full optimization problem, without ad hoc 
simplifications of the design space. We will employ a sample-average approximation (SAA) 
that proves to be effective for complex domain geometries. Benefits of this approach include 
generating batches of experiments with lower computational complexity than sequentially 
minimizing the ALC criterion; eliminating undesirable boundary clustering effects associ¬ 
ated with radial basis function kernels, which plague ALM designs [48, 34]; and achieving 
better approximation performance than either ALM or MI. Finally, because we perform de¬ 
sign on a continuous space of candidate points, it becomes natural to analyze the stability 
and accuracy of approximation with IVAR-optimal designs from the perspective of numeri¬ 
cal analysis. For example, in Section 4 we will show that our algorithm generates point sets 
with good interpolatory properties, as measured by their Lebesgue constants. Finding these 
point sets via a statistical criterion raises interesting links with previous work in Bayesian 
numerical analysis [44, 15], particularly the average-case quadrature of [40, 46]. A con¬ 
tinuous design procedure also facilitates more cleanly comparing GPR and pseudospectral 
approximation. 

Our comparison of GPR with pseudospectral approximation has two elements. First, 
we use Mercer’s theorem to rewrite GP regression in terms of orthogonal eigenfunctions of 
the kernel, such that when these eigenfunctions contain the finite basis for a pseudospectral 
approximation, one can directly assess the difference between the two approximations. If 
experimental design is based on an orthogonalizing quadrature rule, the difference between 
the GP mean and the pseudospectral approximation is due to eigenfunctions of the GP ker¬ 
nel which are not in this finite basis. These leftover eigenfunctions also account entirely for 
the integrated variance of the GP. Furthermore, we illustrate through numerical examples 
that experiments achieving optimal IVAR for these GPs can differ qualitatively from stan¬ 
dard quadrature rules, and that when the IVAR criterion is then used to select experiments 
for GP regression, GPR can outperform pseudospectral approximation in some settings. 
Second, we consider adaptive procedures for GPR that interleave IVAR-based experimen¬ 
tal design with adaptation of the kernel hyper parameters. For test problems of moderate 
dimension, we find that GP approximations constructed in this way can again outperform 
certain adaptive pseudospectral approaches [7]. 

This paper is organized into two parts. The first part reviews Gaussian process re¬ 
gression (Section 2), introduces the IVAR criterion and its optimization (Section 3), and 
describes numerical comparisons of IVAR with other experimental design procedures for 
GP regression (Section 4). The second part provides a brief background on pseudospec¬ 
tral approximation (Section 5.1), then describes theoretical (Section 5.2) and numerical 
(Section 5.4) comparisons between PSA and GP regression. 

2. Gaussian process regression. Gaussian process (GP) regression can be interpreted 
as a Bayesian method for function approximation [45, 49], providing a posterior probability 
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distribution over functions. The method begins with a Gaussian process prior, specified via 
a prior mean function mo(x) and a covariance kernel K(x,x') that is positive semidefinite 
and bounded. Suppose that N simulations of the function / : —>• M are performed 

at parameter values x := [xi,... xjv], x i £ M d , yielding noisy function evaluations y := 
[yi,---VN], where & = /(x;) + & for i = 1, ...,N and & ~ Af(0, a 2 ). The resulting 
posterior distribution is 

f\x, y ~ A f(m(x), C(x, x')), 

where the posterior mean is 

(2.1) m(x) = mo(x) + a T K(x, x), 
and the posterior covariance is 

(2.2) C(x,x') = K(x, x') — K(x,x) T HK(x,x'). 

In the notation above, K(x , x) is a (column) vector in whose ith component is IF (x*, x). 
The covariance matrix R _1 has elements [R — 1 ] ^ = K (x ? ; . xj ) + 5ij(j 2 . Finally, the ?'th 
element of the vector of coefficients a £ M A is a.i = Rr^.i (y — rrio(x)). 

Gaussian process regression reverts to interpolation when a 2 = 0. However, as N —>• oo 
the covariance matrix R _1 becomes ill-conditioned; a small value for a 2 , called a nugget, is 
often then introduced to stabilize the procedure [43]. Note also that we have not included 
inference of the prior mean mo(x) in the Bayesian formulation above. If the prior mean 
is described via some parametric model mo(x) = {3 T n(x), where n(x) is a vector of basis 
functions, then Bayesian inference of the coefficients j3 would add terms to the posterior 
covariance. In practice, however, it is common and quite effective to assume either a zero 
or non-zero constant term for uiq and to fix its value (for example, by maximizing the log- 
marginal likelihood) before performing the GP update; see, e.g., [30, 36, 27]. For simplicity, 
we fix the prior mean here. Doing so will also help focus the comparison of nonparametric 
GP regression with parametric PSA in Section 5 on its essential aspects. 

2.1. Reproducing kernels and Mercer’s theorem. Many elements of this work rely 
on the interpretation of the covariance kernel through its eigenfunctions, and to this end 
we recall the properties of a Mercer kernel. Let the kernel K : X x X —> R be defined 
on a first-countable [16] space X C M. d endowed with a strictly positive Borel measure y. 
Suppose that the kernel is continuous, positive semi-definite, 

(2.3) [ K(x,x')g(x)g(x')dy(x)dy(x’) > 0, Mg G L 2 (X), 

J\x x 

and in L l (X, y) 

(2.4) f \K(x, x)| dy{x) < oo. 

Additionally we define the integral operator Tk '■ L 2 (X) —>• L^(X) such that Txf = 
/ K(x, x')f{x')dy{x'). This operator has a countable system of eigenvalues A j that are 
non-negative and that satisfy 

OO 

E A i < °°- 

3 = 1 
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The eigenfunctions 0, of Tj< form an orthonormal basis of L^(X). These eigenfunctions and 
eigenvalues can be used to define the reproducing kernel Hilbert space (RKHS) associated 
with the kernel [49]. Mercer’s theorem lets us represent I\ as a convergent series in terms 
of this eigensystem. 

Theorem 2.1 (Mercer). [39, 21, 4] Let X C R be first countable or locally compact, fi a 
strictly positive Borel measure on X, and K a continuous function on X xX satisfying (2.3) 
and (2-4) ■ Then 

OO 

(2.5) I\(x,x') = ^2\i(j>i(x)(l>i(x'), 

i =1 

where the series converges absolutely for each pair (x,x') € X x X and uniformly on each 
compact subset of X. 


When comparing GP regression with pseudospectral approximation in later sections of 
this paper, we will also use the notion of a truncated kernel. These are kernels for which 
Xiy.i = 0, and for which we can equivalently write (2.5) as 

t 

K(x,x') = E A icj)i(x)(j)i(x f ). 

i =1 


One common example of a truncated kernel is a polynomial kernel, e.g., K(x,x') = (x T x' + 
l) p , where p is a positive integer. 

We can use Mercer’s theorem to write the integrated posterior variance of the Gaussian 
process in terms of the eigensystem (AThe posterior variance at any point in the 
domain is c(x) := C(x,x). The integrated variance then becomes 


c{x)dji{x) = 


j {k{ x,x) — I\ (x, x) r R K(x, x)J dfi(x) 

/ „ OO OO 

K(x, x)d/a(x) — /EE A*A^fR0 j4>i(x)4>j(x)diJ,(x) 


i =1 j =1 


( 2 . 6 ) 


= E a *-E a ?^ r ^> 


i =1 2—1 


where fii := [fifixfi ),..., 4>i(xN)] T ■ The first term is the integrated variance of the prior. 
The second term, which is always non-negative, reflects reduction in the integrated prior 
variance due to conditioning on the data. We will now examine how experimental design 
procedures can use this integrated variance as an optimization objective. 

3. Integrated variance experimental design. We will design experiments to minimize 
the integrated posterior variance (IVAR) of the Gaussian process. This choice is motivated 
by inferential considerations. As opposed to design procedures based on Latin hypercube 
sampling, quasi-Monte Carlo, quadrature, or other “lattice” designs, the present design 
strategy directly aims to minimize a measure of the uncertainty associated with the approx¬ 
imation. One advantage of this approach is that it can be used to design experiments on a 
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wide variety of input domains, not just domains with tensor-product or some other canoni¬ 
cal structure. Another advantage of computing and monitoring a measure of uncertainty is 
that it provides useful feedback about the quality of the approximation; for example, if the 
data do not yield much reduction in uncertainty, one can adjust the approximation space 
or some other aspect of the surrogate construction methodology. 

To put the IVAR criterion in context, we note that it is equivalent to an expected 
integrated squared error of the posterior mean. First consider the posterior expectation of 
the squared error in function values, integrated over the parameter space, 


E 


f\n,y 


J (/(®)-/(®)) d/i[x) 


where / can be thought of as a posterior realization of the Gaussian process and y is the 
measure on the parameter space X. We can divide this quantity into two terms, 


(3.1) E 


'f\*,y 


f(x)~f(x)) dy{x) 


= / (m(x\x, y) — f(x)) 2 dy(x) + / c(x\x)dy(x), 


where the first term on the right-hand side is the integrated squared error of the posterior 
mean and the second term is the integrated posterior variance, i.e., the IVAR. We have 
explicitly indicated all the conditioning on the right-hand side of (3.1). Note that the 
second term is independent of the sampled values y of the function /. 1 Computing the first 
term, on the other hand, requires the ability to evaluate /. Directly using this term in a 
design criterion would defeat the purpose of experimental design, which is motivated by the 
desire to evaluate / sparingly. Instead, we can consider the expectation of this squared error 
over the joint distribution of / and y , for a fixed design x. We assume that / is drawn from 
the prior J\f (mo(x), I\(x, x')), and therefore this expectation becomes the Bayes risk of the 
posterior mean under squared error loss, which is equivalent to the integrated mean squared 
error (IMSE) criterion proposed by [52], Some manipulation shows that this Bayes risk is 
indeed equal to the integrated posterior variance, i.e., that after taking the expectation over 
/ and y, the two terms on the right side of (3.1) are the same. For further details, see, e.g., 
[54, p. 92], Thus IVAR minimization can also be understood as /- or F-optimal design [1], 
in the sense of minimizing the /U-weighted variance of the predictions over the design region 
X. 

A different connection to optimal design theory can be made by finding a finite parame¬ 
terization (and hence, in general, an approximation) of the Mercer kernel K and converting 
the problem into one of parametric model fitting and prediction [18, 20, 19, 28]. Such ap¬ 
proaches decompose the kernel by computing a Karhunen-Loeve expansion [19, 28] of the 
Gaussian process and then truncating it; alternatively, one can use a polar spectral approx¬ 
imation [61] of the kernel to avoid computing its eigenfunctions. In either case, once the 
kernel has been decomposed and truncated, one can approximate the posterior mean of the 
GP prediction as a linear combination of finitely many basis functions, 

l 

(3.2) m(x) = m 0 (x) + ^ 9i<l>i(x). 

1=1 


1 In this section, we keep the prior covariance kernel K fixed. In Section 5.4, we will consider closed-loop 

adaptive design strategies that learn hyperparameters of the kernel K from evaluations of the function /. 
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In optimal design theory, one may then seek experiments to best learn about the mean 
structure mo, the parameters 6\,... ,9^, or to minimize some uncertainty in the prediction. 
These choices correspond to different classical “alphabetic optimality” criteria applied to 
the corresponding information matrix, e.g., A, D, G, or /-optimal design. 

The IVAR design procedure presented in this paper, however, does not require a fi¬ 
nite parameterization of the kernel; instead it can use a closed-form expression for the 
posterior covariance, with no truncation. A direct comparison between such kernel-based 
procedures (for example, the Sacks-Ylvisaker approach described in [18]) and parametric 
optimal design theory is outside the scope of this paper; we refer readers to [18, 62, 42, 19] 
instead. We will, however, compare our IVAR optimization procedure to other algorithms 
and design objectives that do not require explicitly finding a finite parameterization of the 
kernel. Later, when comparing GP regression to another surrogate modeling methodology— 
pseudospectral polynomial approximation, in Section 5— we will return to the eigenfunction 
viewpoint of GP regression. In that context, our focus will not be on algorithms that re¬ 
quire explicit access to the eigenfunctions of the kernel, but rather on how the eigenfunction 
viewpoint exposes the distinct modeling assumptions made by the two methodologies. 

3.1. IVAR evaluation and minimization. For a chosen number of experiments N > 1 
and a fixed prior covariance kernel A', our optimal experimental design is a set of evaluation 
points a;* := [xf,... x* N ], x* £ M d , minimizing the integrated posterior variance: 

(3.3) x* = argmin / c(x\x)dp(x). 

x£U Jx 

Here U is the space of all feasible experiments and the posterior variance is specified via 

( 2 . 2 ). 

While this objective function is similar to that in [58], here we will employ a continuous 
space IA of possible experiments. Also, we will solve the optimization problem (3.3) both in 
a non-greedy fashion (finding all N design points simultaneously) and using greedy updates 
with varying batch sizes. In this section the number of design points N and the prior 
covariance kernel K will be considered fixed. Later, in Section 5.4, we will consider closed- 
loop design procedures that alternate between batch minimization of IVAR and updates of 
the covariance kernel. 

3.1.1. Sample-average approximation of IVAR. One method of minimizing IVAR in¬ 
volves numerically evaluating the objective in (3.3) via quadrature or a quasi-Monte Carlo 
or Monte Carlo (MC) sampling procedure. Since the variance is in general a smooth func¬ 
tion of x, quadrature schemes may work efficiently for low-to-moderate dimensional input 
spaces, but Monte Carlo will generally work better in higher dimensions; Monte Carlo also 
offers more flexibility for non-tensor-product domains X. Monte Carlo sampling replaces 
the integral with the summation 

p -i N-mc 

(3.4) / c(x\x)dp(x) « —— ^ c(xi\x) =: J mc {x), 

Jx A mc “ 

where N mc is the number of Monte Carlo samples and x* ~ /x. Computing the integrated 
variance for a set of N experiments then requires an inversion of the covariance matrix, an 
0(N 3 ) operation, and variance evaluations at N mc points, an 0(N mc N 2 ) operation. 
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The sample-average approximation (SAA) [59] approach to optimization simply replaces 
the expectation in the objective (3.3) with a quadrature or Monte Carlo approximation at 
a fixed set of points and minimizes this objective. After one has chosen this set of fixed 
points, the minimization becomes a constrained deterministic minimization problem. We 
can use readily available analytical derivatives of the objective in this setting. In particular, 
given the form of the kernel K, we can directly compute the gradient N x c{x\x ) from (2.2); 
details are given in Appendix C. The gradient of the SAA objective obtained from Monte 
Carlo then becomes 

-i N me 

yj mc {x) = — Va-c(xi|x), 

JV mc i=1 

and similarly for quadrature. 

Finally, we note that in some cases it may be more convenient to work with closed- 
form expressions for the eigenfunctions of the kernel K rather than the kernel itself; such 
situations arise when one has a desired basis of approximation but the corresponding closed- 
form kernel is unknown, or if the eigenfunctions can otherwise be easily computed. In this 
case, one can rewrite the IVAR objective in terms of eigenfunctions and simply maximize 
the second term on the right of (2.6). Written in this form, the objective does not require 
integration with respect to the parameter measure /jl. Indeed, having the eigenfunctions in 
hand is tantamount to already having performed the integration, as the eigenfunctions are 
solutions of the homogeneous Fredholm integral equation with operator Tj<. [22] uses this 
approach for IVAR minimization, exploring truncations of (2.6) to t eigenfunctions. 

3.1.2. Batch and greedy implementations. In this work, we have used gradient-based 
optimization algorithms from both the NLopt [31] and SciPy [32] optimization packages, 
with similar performance, to solve the optimization problem (3.3). This problem involves N 
design points, each in d dimensions, and thus has Nd unknowns. It can become expensive to 
solve when Nd is very large. In these cases it may be useful to solve a sequence of smaller 
optimization problems to achieve an approximate solution. In the numerical examples 
below, we will investigate constructing these smaller problems through the use of a greedy 
minimization procedure. In this procedure one decides how many training points M are 
computationally feasible to minimize. Suppose that k = N/M is an integer. Then the 
greedy procedure solves k optimization problems of size Md. Once the (j — l)th problem 
is solved, for j < k, we have obtained the experiments Xj -During the jth iteration we 
find M points to append to Xj- 1 . 

3.2. Entropy and mutual information. Statistical criteria underlie many other experi¬ 
mental design procedures for Gaussian process regression [53]. Two popular techniques in¬ 
clude minimizing the conditional entropy of the Gaussian process at unobserved locations, 
or maximizing the mutual information (MI) between the locations at which experiments 
are performed and the rest of the design space. Our numerical results will compare IVAR 
designs with MI and entropy-based designs because the latter have algorithms specifically 
tailored for Gaussian process regression. For a comparison between these two methods and 
additional design procedures, e.g., based on classical alphabetic optimality criteria, we refer 
to [34]. 

The conditional entropy design procedure seeks experiments that reduce the uncertainty, 
as measured by entropy H, across a typically finite set of possible simulation locations. If the 
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set of candidate experimental locations is denoted by T> and x C T> are the chosen locations, 
then x c := T>\x are the locations at which the entropy is evaluated. In particular, one 
seeks x to minimize the conditional entropy H(F x c\F x ) = — f p(f x c , f x ) log p(f x c \ fx ) df x <df x , 
where F s is a random variable representing the outputs of the simulation model at a set 
of inputs s and p denotes a probability density. Minimizing this function is shown to be 
NP-hard in [33]. In practice, one instead employs greedy but suboptimal algorithms that 
add one experiment at a time—for example, adding each experiment at the location where 
the conditional entropy is largest [60, 9, 38]. In the context of GPR, F x and F x c are jointly 
Gaussian; this procedure then becomes equivalent to greedily adding experiments at the 
locations of highest variance, and is commonly called the MacKay criterion (ALM). Other 
algorithms for choosing a subset of points to minimize the conditional entropy are based on 
the DETMAX algorithm [41], demonstrated for GP regression in [10] 

The mutual information criterion for experimental design considers the change in the 
entropy of F x c before and after performing experiments at locations x: H(F x c) — H(F x c\F x ), 
which is equivalent to the mutual information of F x and F x c. This objective is also typically 
maximized in a greedy fashion: from the candidate set of experiments T>, the element which 
yields the greatest MI is chosen at each iteration. Unlike the conditional entropy, however, 
MI is submodular, guaranteeing that a greedy approach performs within a constant factor 
of the full V-experiment maximization [34], The greedy procedure requires the inversion 
of a matrix containing the covariance between every pair of candidate experiments, which 
arises when computing the entropy on the set of all simulation locations. If one is choosing 
N experiments from a set of M > iV candidates, each iteration then requires inverting a 
matrix of size M x M. This expense is typically large and many recent efforts have aimed 
at reducing it. [34] reduces the cost of each iteration to 0(M ) by using specialized local 
kernels. [3] points out that the quality of an MI design depends crucially on the candidate 
set, and modifies the greedy algorithm of [34] to be more robust by resampling the set of 
candidate experiments after each new point is chosen. 

Besides the choice of objective, our IVAR-based design algorithm differs from the en¬ 
tropy and MI approaches above in other fundamental ways. First, we select experiments 
from a continuous design space and thus avoid the challenges of combinatorial optimization. 
In this sense, we follow the approach of [52] by solving a continuous optimization problem 
using gradient descent algorithms. Second, as described in Section 3.1.2, our approach can 
identify multiple experiments simultaneously. Designing batch experiments is advantageous 
because interactions between the experiments are taken into account; we will demonstrate 
this advantage empirically in the next section. Our continuous design approach also takes 
several steps beyond [52]. First, we do not employ particular patterns or partitions to 
simplify the design space. And as described earlier, we introduce a sample-average approx¬ 
imation of the IVAR objective. This helps design experiments on complex input domain 
geometries by penalizing proximity to the domain boundaries; the objective automatically 
becomes large in locations where there are few samples. We will also address the “pileup” 
problem identified in [52], explaining it in terms of the design size relative to correlation 
kernel complexity. 

3.3. Computational complexity. In some sense, if evaluating / is sufficiently expensive, 
then the computational cost of finding a good design is immaterial. But the design proce¬ 
dures described above can have very different costs, and in practice one may not want the 
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computational effort required for experimental design to be too large. Table 1 summarizes 
the computational complexity of evaluating and optimizing the various experimental design 
objectives considered above. In the IVAR scenarios, N mc denotes the number of Monte 
Carlo points sufficient for (3.4), where typically N mc N. For the entropy and MI crite¬ 
ria we assume that the number of candidate designs Ne = \D\ N. Finally, we assume 
that optimizing the IVAR objective requires L objective and/or gradient evaluations, where 
typically L < N mc . 

Greedy minimization of the conditional entropy (ALM) requires N iterations. In each 
iteration the variance must be computed at Ne locations. 2 At iteration k < N, this 
variance computation requires an 0(k 3 ) inversion followed by Ne variance evaluations, 
each of complexity O (k 2 ). Thus the complexity for step k is 0{k 3 + IV^fc 2 ). The total 
complexity is thus O (j2k=i k 3 + IVe/c 2 ) = O (IV 4 + NeN 3 ). 


Design objective 

Optimization method 

Objective eval 

Optimization cost 

IVAR 

Monte Carlo, batch 
eigenfunction form, batch 

O {N mc N 2 + N 3 ) 
O {IN 2 + N 3 ) 

O {LN mc N 2 + LN 3 ) 
O (£LN 2 + LN 3 ) 

entropy 

greedy 

0{N 3 + N e N 2 ) 

0(IV 4 + n e n 3 ) 

mutual information 

greedy 

oWF) 

0{NN%) 


Table 1 


Computational complexity of different experimental design algorithms. 

The ALM approach is often the computationally cheapest option, but not if Ne > 
LN mc /N. If Ne and N mc are of comparable magnitude, then the comparison of entropy 
and IVAR minimization depends on whether L/N < 1, i.e., how the number of optimization 
iterations (in the continuous case) compares to the number of candidate design points (in 
the discrete case). We also see that the MI design becomes very expensive for large Ne) 
a small candidate design set must be chosen for the MI procedure to remain tractable. 
Using a smaller candidate set, however, increases the possibility that the chosen designs 
will perform poorly. 

4. Numerical examples of IVAR designs. We now provide numerical examples il¬ 
lustrating the quality of designs arising from continuous IVAR minimization. All of the 
examples in this section were performed using the freely available GPL-licensed GPEXP 
package [25] for python. First, we examine the interpolatory properties of IVAR design 
points. Then we compare IVAR, entropy, and Mi-based designs on several domains. We 
also include comparisons with space-filling designs obtained by the method of Lekivetz and 
Jones (LJ) [35], which is well suited for many non-rectangular regions. The LJ algorithm 
works by grouping randomly sampled points into equally sized clusters using Ward’s min- 
inurn variance criterion; one design point is then extracted from each cluster (for instance, 
by computing the cluster centroid). The LJ approach encounters difficulties for non-convex 
domains, and thus we do not use it in the non-convex test case of Section 4.2.3. 

2 Although we have described a discrete approach for ALM minimization, a continuous optimization 

approach is also possible. But the pointwise variance of a GP has many local maxima, and a multi-start 

procedure would likely be necessary for continuous optimization to be competitive with discrete enumeration. 
It is then unclear whether continuous optimization would in fact be more efficient in this case. 
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4.1. Stability of interpolation. As discussed above, GP regression with a 2 = 0 (and 
a kernel rank l > N ) corresponds to interpolation. In this setting, it is useful to analyze 
the quality of interpolation on IVAR design points. The Lebesgue constant is often used to 
bound the error of a polynomial interpolant relative to the best approximation in a polyno¬ 
mial space of equivalent degree, but it can also be used to analyze interpolation with positive 
definite kernels, corresponding here to the posterior mean of the GP. Let m[x ) denote the 
interpolant or posterior mean, and let m*{x) denote the best approximation of /, in the L 00 
sense, from the finite-dimensional kernel space H(K,xn ) = span{A'(-, x\),.. ., K(-, xjv )}- 
We restrict our attention to approximation on compact domains X C Then the L°° 
interpolation error is bounded as [17] 

11/ - m\\ Loo (x) < (1 + A a>jv )||/ - m*|| LooW , 

and moreover we have [13], 

IMIloRt) ^ ^k,x n W vhoo- 

The Lebesgue constant A k,x n for prior kernel K and design points xjy is 

N 

(4.1) A k , Xn = max^2\uj(x)\, 

X 3 =1 

where {uj(x)}jL ] are the cardinal functions satisfying Uj{xf) = 6 t j, such that the interpolant 
can be written as m(x) = J2jLi Uj u j( x ) with y 3 = f(x 3 ). We evaluate the cardinal functions 
as in [17] by solving the linear system 

(4.2) R _1 ti(x) = K(x,x), 
where u(x) = [rti(x)... uj\r(x)]. 

Now we conduct a simple numerical experiment to evaluate the Lebesgue constants 
of point sets arising from IVAR minimization. We find IVAR designs on the domain 
X = [—1,1] with squared exponential kernel K(x,x') = exp(—(x — x') 2 /21 2 ). Figure 1 
shows the associated Lebesgue constants as a function of number of design points N, for 
various correlation lengths l. Several interesting trends are observed. First, we see that for 
any value of l, the Lebesgue constant is exactly one for sufficiently small N. This observa¬ 
tion is consistent with the asymptotic estimate for l —> 0 in [50]. The Lebesgue constant 
reverts to one for small point sets because the IVAR criterion ensures that the points are 
well separated, and hence the cardinal functions do not overlap. Once a certain threshold 
value of N is attained, however, the Lebesgue constant begins to increase; this threshold 
value is smaller for larger values of the correlation length l. This transition coincides with 
interactions among the cardinal functions: Figure 2 shows cardinal functions for IV = 16 
and l = 0.1, just beyond the threshold where interaction becomes significant. In this regime, 
the Lebesgue constant increases steadily with N. For a sufficiently large N, RN 1 becomes 
poorly conditioned and direct computations of the interpolant, the cardinal functions, and 
the Lebesgue constant are no longer numerically stable. From the Bayesian perspective, 
this ill-conditioning corresponds to the “complexity” of the RKHS associated with the prior 
kernel—i.e., the effective number of nonzero eigenvalues in (2.5)— being exceeded by the 
data. 
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Figure 1. Lebesgue constants for N-point IVAR designs on [— 1 , 1 ] with a squared exponential kernel and 
various correlation lengths l. 



Figure 2. Cardinal functions for interpolation on 16 IVAR points with a squared exponential kernel and 
l = 0.1 


The success of the IVAR optimization procedure itself also depends on how N relates 
to the complexity of the kernel. In the small -N and small-/ regime (intuitively, “too few” 
points relative to the complexity of the kernel), the IVAR cost function is relatively flat as 
a function of the design coordinates x, and distinguishing the quality of different designs 
becomes difficult. By contrast, in the large-./V and large-/ regime (“too many” points relative 
to the complexity of the kernel), the IVAR value itself is exceedingly small and the problem 
is poorly conditioned, as described above. Away from these limiting regimes, however, the 
IVAR design procedure yields relatively slow growth in the Lebesgue constant and thus 
relatively stable interpolation. 

4.2. Approximation comparisons. Here we illustrate IVAR designs on variety of input 
domains X, and compare the performance of IVAR designs with that of designs obtained 
through conditional entropy minimization or MI maximization. We highlight designs and 
approximations on irregular input domains (e.g., domains that are neither hypercubes nor 
M. d ), which are critical in many real-world applications. In particular, irregular domains 
often arise as a result of domain partitioning by a discontinuity detection algorithm, for 
models whose outputs are piecewise smooth. For example, in [29, 26] the authors automat¬ 
ically partition the input domain of a genetic toggle switch model that exhibits a phase 
transition. Following this partitioning, function approximation proceeds on two separate 
but irregular domains. In [55, 56] the authors study a climate model which exhibits a 
discontinuity. Discontinuity detection is used to split the input domain into two irregular 
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subdomains, and is followed by function approximation. We will thus attempt to show the 
applicability of our algorithm to input domains that are characteristic of such problems, 
which have no guarantees of convexity or even connectedness. 

In most of our numerical experiments, we employ an isotropic squared exponential kernel 
K(x,y) = exp(— ||ar — y\\ 2 /2l 2 ) and we set a 2 = 1CF 10 ; the only exception is in Section 4.2.2, 
which uses a periodic kernel to contrast space-filling designs with designs that are adapted to 
the kernel. Entropy minimization is pursued through the ALM approach described earlier; 
each experiment is chosen by comparing the variance at Ne = 10 4 possible experimental 
locations sampled according to the parameter measure /x on X. For MI maximization, we 
select designs from Ne = 200 candidate locations generated by the space-filling LJ design. 
The size of the MI candidate set is chosen so that the computational times of the different 
experimental design procedures are comparable; it is also comparable to sizes used in the 
literature [3] . We find that small changes in the size of this candidate set do not qualitatively 
change the results reported below. 

Finally, we define the relative L 2 error of a GP approximation as \\f — m lk/ll/lk , and 

MM 

we estimate it using 10° Monte Carlo samples. While in the previous section we considered 
L°°(X ) error, recall that the IVAR objective function reflects the expectation of a squared 
/x-averaged error. Thus L 2 is a logical metric of quality. Other efforts, e.g., [37], empirically 
evaluate the L 2 and L°° errors of approximations resulting from a variety of other design 
criteria. 

4.2.1. Circular domain. We first construct designs on a circular domain in R 2 with 
radius 0.7. We fix the correlation length in the kernel K to l = 0.2; the measure /x is 
uniform over the domain. Results from each design strategy are shown in Figure 3, for 
designs with N = 8, 12, and 20 points. We consider full batch IVAR, designing all N points 
simultaneously, along with greedy IVAR strategies that add points in groups of M = 1 or 
M = 4. Full IVAR results in the most symmetric and regularly spaced points. But all of 
the IVAR strategies attempt to spread the points throughout the domain, whereas entropy 
design first places points on the boundaries, which is not a desirable feature for radial basis 
function kernels [48, 34]. The LJ designs are reasonably space-filling, as expected, but full 
IVAR yields even more consistent spacing. For timing reference, we note that finding a 40- 
point design with the full IVAR criterion took 11 seconds, IVAR-1 took 65 seconds, IVAR-4 
took 29 seconds, entropy-based design took 10 seconds, and MI design took 79 seconds. 
Thus non-greedy IVAR design and entropy design take approximately the same amount of 
time, while MI maximization is slightly more costly, though of the same order of magnitude. 

To evaluate the effectiveness of these designs for function approximation, we perform 
GP regression on 1000 functions independently sampled from the prior Gaussian process, 
with results shown in Figure 4(a). For each sampled function /, we compute the relative L 2 
error between the posterior mean m of the GP and /, as described above. We then report 
the average and standard deviation of this error, for each design strategy and different values 
of N. The IVAR designs, including the IVAR-M greedy strategies, clearly outperform the 
entropy and MI designs. Optimality of the full IVAR designs is to be expected, since we are 
essentially calculating the Bayes risk (the expectation of [[/ — m|| r 2 ) which is equivalent to 

-^M 

IVAR, as discussed in Section 3. But it is noteworthy that even the greedy IVAR strategies 
show better performance than the MI and entropy designs. The LJ designs show errors 
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IVAR IVAR1 IVAR4 Entropy LJ Ml 





Figure 3. Experimental designs on a circuar domain obtained using various strategies for an isotropic 
squared exponential kernel with l = 0.2. IVAR-M corresponds to a greedy IVAR strategy, adding points in 
batches of M. 



(a) Two-dimensional domain; prior corre- (b) Five-dimensional domain; prior corre¬ 
lation length l = 0.2 lation length l = 0.5 

Figure 4. Relative L 2 approximation error on 1000 functions drawn from a GP with squared exponential 
kernel; error bars indicate sample standard deviations of these errors. Domains X are balls in R 2 orR 5 . 


comparable to the greedy IVAR strategies, but not as low as full IVAR. In Figure 4(b), 
we repeat this study for a higher-dimensional domain—a ball in R 5 —and observe similar 
trends. For reference, the computational times required to find 40-point designs in d = 5 
dimensions are: 13 seconds for batch IVAR, 125 seconds for IVAR-1, 45 seconds for IVAR-4, 
10 seconds for entropy, and 82 seconds for MI. These results further support the idea that 
full IVAR is computationally competitive with entropy and that design with MI is more 
expensive. 
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C\] 


IVAR IVAR1 IVAR4 Entropy LJ Ml 



Figure 5. Experimental designs obtained using various strategies for a periodic kernel on the domain 

[-M] 2 . 


4.2.2. Periodic kernel. The IVAR, MI, and entropy-based design criteria explicitly 
incorporate the kernel of the GP, while the space-filling LJ design approach does not. Thus 
it is reasonable to expect kernel-adapted designs to be more successful for a broader range 
of kernel specifications. To assess this behavior, we repeat the experiments of the previous 
subsection on a square domain with a periodic kernel. In particular, we use the kernel 


K(x,y;p,l ) = exp 



sin 2 (7r|x — 

li¬ 



on the domain [—1, l] 2 , with period p = 1.0 and correlation length l = 0.9. The resulting 
design points are shown in Figure 5 and the function approximation results are shown in 
Figure 6. Now, optimal designs for the IVAR, entropy, and MI schemes are not space filling. 
They are clustered in areas that reflect the periodic structure of the covariance kernel. The 
space-filling LJ nodes do not exploit this property and, as shown in Figure 6, yield higher 
errors than the kernel-adapted nodes. 


4.2.3. Non-convex, non-simply connected domain. Figure 7 illustrates experimental 
design on a parameter domain that is neither convex nor simply connected, found via full 
(non-greedy) IVAR minimization. These designs are obtained with an isotropic squared 
exponential kernel with correlation length l = 0.1. The IVAR objective is minimized using 
the SAA approach described in Section 3.1.1. Optimization for each IV-point design began 
from a single randomly-generated point in the 21V-dimensional design space; no multi-start 
procedure was used here. The designs do a good job covering the domain. While twelve- and 
sixteen-point designs are not completely evenly distributed, designs using 32 and 60 points 
are very well spaced and have interesting symmetries. The approximation performance of 
IVAR designs—even greedy IVAR-M designs—is also superior to that of the entropy and 
MI approaches, as shown in Figure 8. 
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Figure 6. Relative L 2 approximation error on 1000 functions drawn from a GP with periodic kernel on 
[—1, l] 2 with parameters p = 1.0 and l = 0.9; error bars indicate sample standard deviations of these errors. 



- 0.5 0.0 0.5 - 0.5 0.0 0.5 


(a) 12 points (b) 16 points 



- 0.5 0.0 0.5 - 0.5 0.0 0.5 


(c) 32 points (d) 60 points 


Figure 7. Experimental designs computed by minimizing the IVAR cost function over a non-convex 
and non-simply connected domain; the domain is the red/shaded region, endowed with a uniform parameter 
measure. 



Figure 8. Performance of IVAR minimization on non-convex, non-simply connected domain with l = 0.1. 
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5. Comparisons with pseudospectral approximation. Pseudospectral approximation 
(PSA) is a well-established approach for computing polynomial approximations from point- 
wise function evaluations f(x) [63, 8, 7]. Recall that PSA comes equipped with an experi¬ 
mental design procedure and an algorithm which seeks accuracy in an L ^ sense. In Sections 
2-4, we described an experimental design procedure for GP regression that also targets an 
expected L^-type error. Thus we are now well positioned to make a comparison between GP 
regression and PSA. In particular, our comparison of these surrogate construction methods 
arises from the perspective of the three choices described in the introduction: the approxi¬ 
mation space, the experiments, and the algorithm for constructing the approximation itself. 
Our comparison comprises three components described in Sections 5.2, 5.3, and 5.4. We 
first discuss a theoretical result bounding the error between the GP posterior mean and the 
PSA under orthogonalizing designs. Then we show what happens when we relax the idea 
of exact orthogonality. Finally, we provide some numerical comparisons between the two 
methods on canonical approximation problems. 

Overall, our goal is not so much to compare performance but rather to understand the 
links and distinctions between these approaches. In particular, we would like to know: are 
GP regression and pseudospectral approximation ever equivalent? Do the point sets used 
for pseudospectral approximation yield low IVAR when used for GP regression? How do 
IVAR-minimizing designs compare with quadrature rules, for appropriate choices of kernel? 


5.1. Pseudospectral approximation. In this section, we recall the basics of pseudospec¬ 
tral approximation. Consider a set of basis functions {ipi(x)} comprising a complete or¬ 
thonormal system in . Pseudospectral approximation takes advantage of orthonormality 

= j A(x)A( x )dfJ-(x) = Sij, 

to compute the projection of a function / onto a subspace of L 2 spanned by t basis functions. 
The orthogonal projection 

i 

fe( x ) =^2(f,A)^A( x )- 

i— 1 

converges in the T 2 sense as the subspace grows: lim^oo f (fa — /) 2 d/i = 0. Pseudospectral 
approximation departs from exact orthogonal projection by using numerical quadrature to 
approximate the inner products (/, V’i)- In particular, one seeks an orthogonalizing set of 
nodes and weights, Q = {(x k ,w k ) : k = 1... N}, i.e., a quadrature rule that computes the 
inner products between the first £ basis functions exactly: 

N 

(5.1) (ipi,ip j )^ = Y w kA(x k )A( x k )^ i,j = 1, 

k =1 

This choice eliminates internal aliasing error [7], though the pseudospectral approximation 
will still exhibit external aliasing error, since / is not necessarily in the span of ..., '(/y}. 
Given an orthogonalizing rule (5.1), a pseudospectral approximation fi can be written as 


N 


N 


(5.2) fi(x) = 5Z w kf{x k )A{x k ) A(x) = 


i =1 \fc=l 


k =1 


Wkf{x k ) YA{xk)A{x) 


\i=l 
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In subsequent analysis, it will be convenient to express fe(x) in matrix notation. Let 
W := diag(u>i,. ..w N ), y := [f(x i),.. .,f(x N )] and fa := ■ ■ ■ ,^i{x N )]. Then 

i 

(5.3) f e (x) = y T W ^ ifiifi(x). 

i =1 

Examples of {V'i} and Q include one-dimensional orthogonal polynomial families and 
their corresponding Gaussian quadratures; or tensorized versions of each in multiple dimen¬ 
sions. However, the basis functions ifi do not in general need to be polynomials, and other 
quadrature rules besides Gaussian rules may exist. Using Mercer’s theorem, we can already 
see that J2i =l ' t Pi( x )' l l ; i(y) can be interpreted as a truncated kernel K e (x, y ) with eigenvalues 
that do not decay, and we can interpret the pseudospectral approximation given in (5.3) as 
a weighted combination of such kernels. 

5.2. Same approximation space and experiments. To begin our comparison, we will 
fix two of the choices involved in surrogate modeling and evaluate the impact of the third. 
In particular, this section will compare the impact of different algorithms —i.e., using pseu¬ 
dospectral approximation versus GP regression—for identical experiments and for the same 
or similar approximation spaces. Because pseudospectral approximation requires experi¬ 
ments to be chosen according to an orthogonalizing rule, we will also use these experiments 
for GP regression. We are now left to relate the basis for the pseudospectral approximation 
to the GP kernel, as these objects determine the approximation space. 

Theorem 2.1 lets us represent any Mercer kernel, and the corresponding GP posterior 
mean function, using the eigenfunctions. With this connection, we first develop a more 
general result than required. Specifically, we show that when the basis for pseudospectral 
approximation comprises a subset of the eigenfunctions of a given kernel, then the L 2 - norm 
of the difference between the resulting GP posterior mean function and the pseudospec¬ 
tral approximation may be bounded. Results for identical approximation spaces follow 
immediately as a corollary of this general case. 

For the result below, we will assume that we have a Mercer kernel consisting of Iqp 
eigenfunctions, where £qp could be finite or infinite, and a spectral expansion consisting of 
the first l < £qp eigenfunctions, where £ is finite. The approximations will be constructed 
from N evaluations of the function /, performed at nodes of a quadrature rule {(xi, Wi)} , i = 
1 ... IV, that orthogonalizes the first £ eigenfunctions. Clearly N depends on i. Let USU r 
be the eigendecomposition of the matrix of covariances K (x'j, Xj) among all the design points 
computed without a nugget, i.e., R ~ 1 -a 2 I = XJSV T . S is a diagonal matrix with elements 
S = diag(si,..., sat). Finally, assume a zero mean prior mo(x) = 0 for the GP model. The 
latter assumption does not restrict the generality of our results. In fact, one can transform 
the problem from one of approximating / to one of approximating g = f — mo, for some 
fixed function mo(x). Of course, this transformation requires one to translate the data y 
accordingly. Under these conditions, we have the following result, whose proof is given in 
Appendix A. 

Theorem 5.1. Let fefx) be a pseudospectral approximation represented with basis func¬ 
tions {ipi ,..., ipe}, computed via an orthogonalizing quadrature rule Qi as in (5.1)-(5.2). 

Let M = max \ r d> i (x)\. Also, let w mSLX = max w; . Let mix) be the poste- 

(x,w)eQe, ie{ l,ie{i,...,JV} 

rior mean of GP regression with prior covariance kernel K(x, x') = J2i=i ^i<t>i(x)<t>i{x') and 
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nugget term a 2 > 0, constructed from function evaluations at the nodes of Q{. If I, N < oo, 
ifi = <f>i for i = 1.. .1, and I < Igp, then the difference between these two approximations 
is bounded as 


\m- fi \\ 2 L 2 < ||y 


INM 2 w 2 m& - 
( s N + a 2 y 


t-GP 

E 

j=i +i 


+ a 2 1 


£GP 

+ E R2( h 

j=£+l 


We can consider a relative notion of error by dividing each side by ||y \\ 2 . Furthermore, since 
we are dealing with bounded functions / and finite amounts of data, we can always normalize 
and center the data to make ||y ||2 = 0(1). The difference between the two approximations 
described in Theorem 5.1 has two sources. First is the contribution of kernel eigenfunctions 
that are not in the pseudospectral approximation basis, i.e., the summations involving j > I 
above. Second is the contribution of the noise term a 2 . 

Because the kernel eigenfunctions span the RKHS containing the GP posterior mean, 
the case of equivalent approximation spaces for pseudospectral approximation and GP re¬ 
gression follows simply by setting Igp = I < oo. This case is highlighted by Corollary 5.2. 

Corollary 5.2. Let Igp = I • Then the difference between the pseudospectral approximation 
fe(x) and GP posterior mean m(x) defined in Theorem 5.1 is: 

(5-4) ||m - kWh < Ml ZNM 2 wl ax (—• 

e \SN + V Z J 


In this case, the difference between approximations is due only to the nugget a 2 > 0 and 
the smallest eigenvalue of the covariance matrix as indicated by sat. If we additionally have 
that N < Igp , then the design covariance matrix R 1 remains invertible (i.e., with sjy > 0) 
even as a 2 —>• 0, yielding zero difference between the approximations. This occurs, for 
example, in the case of kernels constructed from fully tensorized polynomial eigenfunctions 
and tensorized Gaussian quadrature rules, where N = I = Igp- 

We also note that the bound in Theorem 5.1 depends on the minimum eigenvalue of 
the covariance matrix, represented (after diagonalization) by S + <r 2 I. If S is nearly rank- 
deficient and the nugget is sufficiently small, then the bound can be large. This situation is 
not purely an artifact of the theory; indeed, it corresponds to a poorly conditioned numerical 
problem, where the actual difference between the two approximations may be large as well. 
One may imagine a case where R is not invertible, e.g., too many quadrature nodes are 
used and the GP kernel is finite rank; in this case the computation of m becomes unstable 
whereas the computation of fi can still proceed in a stable manner. 

Besides bounding the difference between approximations, we can also analyze the impact 
of an orthogonalizing experimental design on the integrated variance of the GP posterior. 
Consider, again, a kernel consisting of Igp eigenfunctions and a training set corresponding 
to the nodes of a quadrature rule Q that orthogonalizes the first I eigenfunctions. We begin 
by splitting (2.6) into summations involving the first I eigenfunctions and the remaining 
I + 1 to Igp eigenfunctions: 




c(x)dn(x) = E (l - R</>,:) + Xi (l - Xi<f)fK<f). 


i— 1 


z=£+l 
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The second term in this expansion represents the contribution of the extra eigenfunctions 
to the integrated variance. We cannot comment on how our training points will affect this 
term because they are designed to have special properties only for the first £ eigenfunctions. 
But we can use these properties to analyze the first term above, thus describing the impact 
of an orthogonalizing rule on the associated integrated variance. Note that the weights 
W do not explicitly enter the GP regression; nonetheless, as we show in Appendix B, this 
portion of the integrated variance can be rewritten as: 

l / . I ( *GP \ 

(5.5) £ Ai (i - A^fW £ Xrfrfj + a 2 I R<^. 

i= 1 i= 1 \j=£+l J 

This expression consists of interactions between the first t eigenfunctions and the last eigen¬ 
functions; it also includes the impact of the nugget. The eigenfunction interactions are ex¬ 
pected because the design only orthogonalizes the first £ eigenfunctions; errors are incurred 
when the numerical inner product is taken between one of the first £ eigenfunctions and 
one of the remaining eigenfunctions. We see that if £ = £qp, the integrated variance is of 
the order of the noise a 2 as expected. When additionally a 2 = 0, the integrated variance 
is zero. An orthogonalizing design ensures that the integrated variance captures only the 
contributions of eigenfunctions which are not orthogonalized. 

The preceding results highlight some of the assumptions underlying the practical ap¬ 
plication of these two surrogate construction methodologies. When using a pseudospectral 
approximation, one implicitly assumes that the true function’s projection onto basis func¬ 
tions not included in the expansion (and hence not orthogonalized by the underlying design) 
is small. Otherwise, more points and basis functions should be added to the approximation; 
indeed, adaptive basis selection is the concern of a vast array of approximation methods, 
pseudospectral and otherwise. In GP regression, a properly chosen kernel is one whose 
eigenvalue decay matches the decay of the spectral coefficients of the true function. In 
this case the function will lie in the RKHS associated with the kernel and an accurate 
approximation can readily be achieved. These two approaches interact through the inte¬ 
grated variance. The orthogonalizing rule in a pseudospectral approximation ensures that 
the difference between / and fe lies mostly in the span of the eigenfunctions <f>k>t■ Corre¬ 
spondingly, this rule forces the IVAR, a measure of the uncertainty and error (via the Bayes 
risk) of the GP approximation, to retain contributions only from the same subspace (that 
spanned by {<fk>e})- We also see that the difference between the GP posterior mean and 
pseudospectral approximation is dominated by these extra eigenfunctions. 

5.3. Relaxing exact orthogonality. It is instructive to consider the tradeoff between 
exactly orthogonalizing fewer basis functions or, for the same design points, generating an 
approximation using more basis functions than can be orthogonalized. The latter corre¬ 
sponds to performing GP regression under the conditions of Theorem 5.1 with £qp > £. 

Consider the function f(x) = sin(7r;c + 0.2) for x € M with standard Gaussian weight 
fi. We will use a Gauss-Hermite quadrature rule with N = 20 points to construct a pseu¬ 
dospectral approximation of / with the first £ = 20 Hermite polynomials (degrees 0 to 19) 
as basis functions. We compare this approximation to Gaussian process regression on the 
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same points using the Mehler kernel [57] 

1 


(5.6) 


K(x, y\ t) = 


VT^ 


exp 


1 (x) 2 t 2 — 2 txy + ( y ) 2 t 2 

2 1 -t 2 


which is a closed-form expression for K(x,y) = t l 4>i{x)4>i{y) , where {^} are again 
normalized Hermite polynomials. The nugget a 2 = 0 and the decay constant t = 0.8. In 
this way we compare two surrogates using identical experiments and closely related approx¬ 
imation spaces: the approximation space of the former is contained in that of the latter. 
But the two surrogates use different approaches for combining the same function evaluations 
into an approximation. GP regression approximates / using an infinite collection of polyno¬ 
mial eigenfunctions, with more emphasis on those associated with higher eigenvalues, while 
pseudospectral approximation projects / onto finite polynomial basis, with the exactness 
of the projection limited by external aliasing. 

Figure 9 shows the results of this experiment. First, we note that the relative L 2 
errors are 8.7 x 10~ 3 for pseudospectral approximation and 1.8 x 10~ 3 for the GP mean 
function. Perhaps more interesting than this improvement, however, is the spectrum of the 
error associated with each approximation. The left panel of Figure 9 shows the projections 
of the errors e ps (x) = fe(x) — f(x) and e gp (x) = m(x) — f(x) onto each basis function 
4>i, i.e., |(e,^>j)|. For reference, the right panel shows how much energy / has in each basis 
direction, via the magnitudes of the exact projections | (/, cj>i) \. (All projections are computed 
with extremely high-order quadrature.) We see that the projection of the pseudospectral 
approximation error e ps onto the first few basis functions is small, even though / itself has 
significant energy in these directions. The projection of e ps then rises with the basis index 
and peaks around an index of 20; it then begins to decay, just as the exact coefficients 
(/, 4>i) themselves decay in magnitude. The projection of the GP error, on the other hand, 
is flatter across the indices. It rises slightly for the first few basis functions and then begins 
to decay somewhat more slowly. The basis functions for which the orange line (GP error) 
lies below the red line (PSA error) correspond to relatively large coefficient values; this 
results in a lower L 2 error overall. 

As a preview of the next section, we include a third line in Figure 9 representing GP 
regression performed with a 20-point IVAR-minimizing design. The relative L 2 error of 
this approximation is even smaller: 1.0 x 10~ 5 . And the spectrum of its error, shown 
with the grey line in the left panel, is even flatter than that of the quadrature-based GP 
approximation. This result amplifies the previous trend: compared to pseudospectral ap¬ 
proximation, GP regression spreads its error more evenly over the spectrum. Departing 
from an orthogonalizing design allows this error to be spread even more broadly. 

5.4. Similar approximation spaces, optimal experiments. Our next goal is to compare 
GP regression and pseudospectral approximation when the approximation spaces are sim¬ 
ilar, but the experimental designs (and the algorithms) differ. We have already seen that 
RKHS containing the GP mean can coincide with the range of the pseudospectral approx¬ 
imation operator in the case of a finite-rank GP kernel: the eigenfunctions of the kernel 
can simply be used as the basis for the pseudospectral approximation. In the numerical 
comparisons below, however, we would like to allow the GP kernel to have infinite rank. 
This choice is more representative of how GP regression is used in practice, and in principle 






22 


A. GORODETSKY AND Y. M. MARZOUK 



Basis number 


Basis number 


Figure 9. Spectrum of approximation errors for the example function discussed in Section 5.3. Left figure 
shows the projection of the approximation error onto normalized Hermite polynomial basis functions <j>i, for 
three different surrogates: pseudo spectral approximation with 20 Gauss quadrature points, GP regression 
with a Mehler kernel on the same Gauss quadrature points, and GP regression with a Mehler kernel on 
IVAR-optimal points. Right figure shows the magnitude of the exact projection of f onto each basis function 
4*1 • 


may allow the GP mean to better approximate a wider variety of functions. Of course, the 
eigenvalues of the kernel need to decay, so that we have a bounded kernel. 

We will therefore use the Mehler kernel (5.6) as in Section 5.3, and consider target func¬ 
tions whose inputs are endowed with standard Gaussian weight on M d . The eigenfunctions 
of the Mehler kernel are Hermite polynomials, which we use as basis functions for pseu- 
dospectral approximation. In d > 1 dimensions, we use a tensorized version of the Mehler 
kernel: K(x,y;ti,... ,tg) = Ilf=i where fj now governs the decay rate of the 

eigenvalues associated with the univariate eigenfunctions in dimension i. 

To design experiments for GP regression, we use IVAR minimization in combination 
with adaptation of the covariance kernel hyperparameters (e.g., correlation lengths) and 
the nugget a 2 . In particular, we choose these parameters by maximizing the log-marginal 
likelihood, 

(5.7) logp(y\x, 9) = ~y T Ry - ^loglRT 1 ] - ^log27r, 

with respect to the complete set of hyperparameters (including cr 2 ), denoted by 6. Our 
approach interleaves kernel adaptation with batch IVAR design. The batch size for each 
design is described in each example, but the number of Monte Carlo points used to evaluate 
the IVAR objective is fixed at N mc = 10 4 . 

For pseudospectral approximation, we use a state-of-the-art adaptive Smolyak algo¬ 
rithm [7], which adaptively enriches both the approximation basis and the experimental de¬ 
sign using a greedy heuristic. The approximation is essentially a Smolyak sum of full tensor- 
product polynomial approximations, each computed using a pseudospectral approach. We 
use this approach because a regular tensor-product approach becomes infeasible for more 
than a few dimensions; hence many sparse grid [2] and dimension-adaptive [24] polynomial 
approximation algorithms have been developed. The Smolyak algorithm in [7] uses gen- 
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eralized sparse grids and dimension adaptivity and thus provides a useful benchmark for 
comparison with kernel-adaptive GP. It is implemented in the MIT Uncertainty Quantifi¬ 
cation framework (MUQ) [47]. Since the input domain is endowed with Gaussian measure, 
the sparse grid design is based on one-dimensional Gauss-Hermite quadrature rules, with 
the number of points growing exponentially with the level of the sparse grid. 

5.4.1. Additively separable functions. The first problems we consider are two dimen¬ 
sional and additively separable. Additive separability is a property favorable to the Smolyak 
algorithm, in part because of the greedy heuristic the algorithm uses to enrich the polyno¬ 
mial basis. The target functions are: 

(5.8) fi(x) = x (1) + (x (2) ) 2 , 

(5.9) /2(x) = sin(27ra;W) + (x^) 2 . 


The batch size for the closed-loop IVAR scheme is one training point. Adaptation is per¬ 
formed for the eigenvalue decay rates t\ and Q in each dimension and for the noise term 
a 2 . Figures 10 and 11 show the resulting relative errors and hyperparameter traces, as a 
function of the number of training points. As in Section 4, the relative errors are defined 
as 11/ — /dU 2 /ll/llz, 2 and ||/ — m(x)\\ L 2 /\\f \\ L 2 for Smolyak pseudospectral approximation 
and GP/IVAR, respectively. These errors are computed using 10000 Monte Carlo points. 
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Figure 10. Error comparisons between GP/IVAR and adaptive Smolyak approximations for target func¬ 
tion (5.8). Right panel shows hyperparameter traces for the GP design procedure. 

In these examples, we observe that the adaptive Smolyak algorithm requires several 
function evaluations until it begins to converge. Once it converges, however, the relative 
error drops to machine precision. (Recall that the quadrature rules used in the Smolyak 
scheme require adding several points at a time.) The GP approximation error, on the other 
hand, decreases steadily but gradually as additional points are added and as the kernel is 
refined. In both cases, by the time the pseudospectral approximation becomes accurate, the 
GP approximation has already achieved at least 10 -4 relative error, perhaps sufficient for 
many applications. The hyperparameter traces show how the GP covariance kernel adapts 
to the function being approximated. Function (5.8) is fairly low order in both dimensions, 
and the hyperparameter values t\ and t -2 both converge to fairly low numbers, indicating 
fast eigenvalue decay. The converged value of t\ is small because only the constant and 
linear eigenfunctions (in the first dimension) are updated by the data; t -2 converges to a 
slightly higher value, thus slowing the eigenvalue decay, to account for the quadratic term 
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(a) Relative error comparisons 
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Figure 11. Error comparisons between GP/IVAR and adaptive Smolyak approximations for target func¬ 
tion (5.9). Right panel shows hyperparameter traces for the GP design procedure. 
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Figure 12. Error comparisons between GP/IVAR and adaptive Smolyak approximations for the Ishigami 
function (5.10). 


in x( 2 \ Function /2 (5.9), on the other hand, is relatively high order in the first dimension 
but low order in the second. We thus observe that t\ converges to 0.9, corresponding to 
a slower decay, and that £2 converges to roughly 0.1; the hyperparameter optimization has 
“learned” that only the first few eigenfunctions in this dimension matter. 

5.4.2. Non-additively separable function. Next we perform the same experiments on 
the three-dimensional Ishigami function 

(5.10) f(x) = sin(x^^) + 7sin 2 (x ( ' 2 ' ) ) + 0.05 sin(a; (1) ), 

The Ishigami function is not additively separable, and we expect to see even better relative 
performance of GP regression in this example since the kernel eigenfunctions include the 
tensor products of all univariate Hermite polynomials. Here we use a batch size of M = 10 
experiments for IVAR design and hyperparameter adaptivity. The results are shown in 
Figure 12. 

We see that the GP begins to outperform the pseudospectral approximation after 
roughly 20 function evaluations, with an error that decreases consistently. The error from 
the adaptive Smolyak approach, on the other hand, plateaus in several regions. After 50 
iterations, the GP approximation reaches a relative error of 10~ 2 , while it takes pseudospec¬ 
tral approximation 100 iterations to reach the same relative error. This behavior may be 
attributed to the fact that the Mehler kernel accounts for interactions among the input 
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Number of training points 


Figure 13. Error comparisons between GP/IVAR and adaptive Smolyak approximations for the ten¬ 
dimensional Genz function (5.11). 


variables immediately, whereas the dimension-adaptive Smolyak approach requires some 
exploration (corresponding to the error plateaus) to find the basis functions that capture 
these interactions. 

5.4.3. Ten-dimensional Genz function. Finally, we consider a higher-dimensional ex¬ 
ample using the oscillatory Genz function [23]: 


(5.11) 


fi{x) = cos 2itwi + 


10 

i =1 


where w\ = 0.3 and the c* are chosen randomly from a uniform distribution on [0,1] and 
then normalized to ||c||i = 2.25. This Genz function is typically evaluated on a hypercube 
domain [— 1,1] 10 with normalization ||c||i = 9. We have found the present approxima¬ 
tion problem, with an unbounded and Gaussian-weighted domain, to be significantly more 
challenging, however, due to the tail behavior of the Hermite polynomials. A comparison 
of approximation errors, again between GP/IVAR using the Mehler kernel and adaptive 
Smolyak pseudospectral approximation, is shown in Figure 13. The GP approximation 
performs extremely well. We note that this Genz function involves non-additive coupling 
among all ten input dimensions, a feature that may amplify the benefits of including a fully 
tensorized set of eigenfunctions via the GP kernel. For the GP regression calculations, we 
initially added experiments in batches of M = 50, learning the hyperparameters after each 
batch, until obtaining 700 experiments total. Then we added experiments 200 at a time. In 
the future, it may be useful to automatically stop adapting the hyperparameters once the 
changes in the hyperparameters between iterations fall below some threshold. 

Note that we were able to compare GP and pseudospectral approximation in these 
examples because we had access to the Mehler kernel, and hence explicit evaluations of 
the Hermite basis functions were not required for GP regression. In practice, we may not 
have closed-form kernels for other basis function families, typically used in pseudospectral 
approximation. Truncated kernels whose eigenfunctions are polynomials, however, can often 
be represented using the Christoffel-Darboux formula [5, 11]. For normalized basis functions, 
the Christoffel-Darboux formula is essentially a finite rank kernel with equal eigenvalues, 
and thus for a fixed basis order, one would not be able to adapt this kernel. 

6. Conclusion. This paper has examined experimental design criteria and optimiza¬ 
tion procedures used to develop surrogates of computational models, highlighting aspects 
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of the interplay between experimental designs and approximation algorithms. In the first 
part of the paper, we discussed experimental design criteria for Gaussian process regression 
and presented an algorithm for minimizing the posterior integrated variance (IVAR) of a 
Gaussian process over a continuous design space. Our approach is adapted to the ker¬ 
nel structure; for instance, with isotropic squared exponential kernels, it yields well-spaced 
points on arbitrary complex domains, avoiding boundary clustering and other undesirable 
artifacts. IVAR points also have good interpolation stability, as measured by the Lebesgue 
constant for kernel interpolation. The underlying optimization problem is tractably solved 
with gradient descent methods, as long as the number of design points is not too small or 
too large relative to the complexity of the prior covariance kernel. Our numerical experi¬ 
ments also demonstrate that simultaneously designing multiple points can yield substantial 
benefits over greedy strategies. Nonetheless, even greedy minimization of IVAR yields bet¬ 
ter approximation performance than standard greedy algorithms for minimizing entropy or 
maximizing MI. 

In the second part of the paper, we compare GP regression to polynomial approxi¬ 
mation. For simplicity, we consider only pseudospectral approximation, using nodes and 
weights that orthogonalize the finite set of basis functions used for approximation. In this 
setting, when GP regression uses the same nodes and a kernel with polynomial eigenfunc¬ 
tions, the difference between the two approximations is due only to the GP “nugget” and 
truncation of the kernel. When instead coupled with an IVAR design, GP regression for 
an infinite-rank kernel sacrifices numerical orthogonality of any given set of eigenfunctions; 
compared to the pseudospectral approach, projection errors may be larger for the first few 
eigenfunctions but are more nearly constant over the eigenspectrum. This observation is 
reminiscent of the average-case quadrature of [40], which compares a Bayesian method for 
deriving quadrature nodes with Gauss rules of fixed degree. We follow these comparisons 
with an empirical study of approximation performance on various test functions, comparing 
adaptive variants of GP regression and sparse pseudospectral approximation. We observe 
that while additively separable functions lend themselves easily to sparse polynomial ap¬ 
proximations, more strongly coupled functions can be more efficiently approximated using 
the GP approach. Kernel approximation is also well suited to complex (e.g., non-tensorized) 
input domains. In our current design approach, while eigenfunctions of the kernel integral 
operator on the domain need not be explicitly computed, eigenfunctions that couple the 
input dimensions are all implicitly present. 

Future work can extend our analysis of experimental design for GP regression in many 
ways. It is useful to compare the present design criteria with other methods for choosing 
“good” interpolation nodes in the radial basis function literature [12, 14]. Comparing IVAR- 
optimal nodes to optimal nodes for “Bayesian quadrature” [40, 46] would also be of great 
interest. More broadly, we advocate for closer connections between the numerical analysis 
literature and the statistics literature on these topics. Finally, adaptive designs that inter¬ 
leave point selection with updates to the kernel would benefit from more rigorous study. 
Current sequential approaches, including the ones presented here, are perhaps reasonable 
heuristics. But look-ahead strategies that anticipate information gain from future designs, 
and that balance information for kernel adaptation with reduction of the conditional pos¬ 
terior variance, may lead to even more effective approximation procedures. 
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Appendix A. Proof of Theorem 5.1. First we recall some notation. The prior precision 
matrix is R = A + ct 2 1) . We define U and S to be the matrices associated 

with the eigenvalue decomposition R = U (S + <r 2 I) 1 U r . 

We assume, without loss of generality, that the prior mean is zero; from (2.1) we obtain 


£gp £ £gp 

m(x) = y T R^2 Xj&jfijix) = y 1 R ^ XjCpjcp^x) + y 1 R Y X j cf) j (j) j (x) 
j =i i = 1 j =£+1 

l 

= y T R^2,Xj(t)j(l)j(x) + a(x), 

3=1 

where we have replaced K(x,x) with its eigenfunctions and then separated the posterior 
mean into two terms, the first containing the first £ eigenfunctions and the second a(x ) 
denoting the rest. 

We now use the orthogonality rule (5.1) to obtain m(x ) = y T R J2j =l Xj<f>j<f>J'W<f)j<f>j(x)+ 
a(x). Recall that the pseudospectral approximation fi is given by fi(x) = y 7 W Yj=i 
such that the difference between the GP and the spectral expansion becomes: 


m(x) - fe(x) = Y \y 7 ( RXj&j&j - i) W(f> j (j) j (x) + a(x). 

3=1 

Now define d T := y 7 ^R Y^j =l — l) W for convenience, and form the L 2 difference 


J (m- dy = J ^55 d T (pjcp^x) + a(z)j d/d 


(A.l) 


= \J2 dJ 55 dT <j> 3 <t>j(x )) + (55 a ( x )) + a ( x )) 

\j=1 3=1 / \j=1 / 

= 4>j(t>j{: x )^d T 4>j(t)j{x) \ +(a(x),a( x)) 

\j=1 3=1 ' 


where the the cross terms resulting from the expansion of the square disappear due to 
orthogonality between the first t basis functions and the k > £ basis functions. 

We now seek to bound the size of the first term in (A.l). Use the orthogonality properties 
of (pj to obtain 



= 51 = Y [[d T <t>i ) 

i,j=l z=l 
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We can further simplify this expression by taking advantage of the invertibility of R and 
rewriting d T as 


£ {<f<h ) 2 = £ ( v t R (£ Wj - w j w* 


i —1 


2= 1 
t 


£gp 


= £ ( ( £ WitJ - I w* 

\i =(+1 


2—1 


< y T y E 


2=1 


R ( E AJ0J0J - (7 2 I ) 
,M+i 


where the first equality is obtained by extracting R, the second equality results from sub¬ 
stituting in the definition of R -1 , and the bound is due to the Cauchy-Schwarz inequality. 
Now we can factorize the last term above because all induced norms are sub-multiplicative, 

/ (gp \ 2 

E(^T<^IIRII 2 E 


2=1 


2=1 


(GP 

E ] wd, 

Vj=£+1 



(S + <r 2 l) _ ‘ 

2 

(gp 

< y T y 


E A /d>,d>J 1 ^T 2 I 




i=£+l 


E n w & 


< y T y 


(SN + O' 2 ) 2 


(GP 

E + 

j=e +1 


, i=l 


J>z 2 <t>Z 2 


where for the third equality we use the fact that W is a diagonal positive definite matrix; 


is 


iCmax is the square of the largest-magnitude weight, u> 2 iax = arg maxjgjq jA r} w 2 \ and Z 2 
the index of the discrete basis function with maximum norm, Z 2 = argmaxjg ^...n (fif (fii- 
Let M = max \d>j(x)\, since the first t eigenfunctions are bounded at the N 

quadrature nodes. Then we have <£% (f) Z2 < M 2 N and 


£ (-1 T <hf < / /E? 

i=i \s N + °) 


(gp 

E A A<E + ^ 

j =(+1 


We now turn to the (a(x), a(x)) term. This term is simplified similarly to the first, using 
Cauchy-Schwarz and the orthogonality of the basis functions: 

(■GP 

{a(x), a(x)) < y 1 y E A jcJ RR<pj. 

j=i+i 


Thus the overall bound can be given as: 


\m- f (\? L 2 < y T y 


I (SAT + CT 2 ) 2 


(■GP 

E + rT '' 1 

j=(+i 


top 

+ y xjtj R2 <ft 

2 j =(+1 
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Appendix B. Derivation of Equation (5.5). 

We begin by splitting (2.6) into two components in order to focus on the first term: 


(B.l) 



We again use the trick of replacing one with W0, to obtain 


t 

4>i) 

1=1 


E 

i =1 
i 

E 


Ai (<pJW(j>i - A,0fR0 l ) 
Xufi (W-AiR)^ 


i= 1 


^GP 


E A^n w E Wifi + - AjI R& 


i=l 

£ 


VJ=! 




y w? Aj+w y + o' 2 w - Aj 


i=1 
£ 


j=«+l 


fcp 


^ Xj4>j4>J + cr 2 I ] R0i, 


i=l 


R</>i 


where the first equality comes from orthogonality, the second equality results from extract¬ 
ing the basis vectors, the third equality results from extracting R on the right, and the 
fourth equality comes from splitting the sum over all Iqp terms into two sums: from 1 to i 
and from t + 1 to Igp- 


Appendix C. Gradient-based IVAR minimization. The gradient of the IVAR objective 
(3.4) with respect to the design variables can be obtained analytically from the underlying 
covariance kernel; this enables a wide variety of optimization approaches to be used for 
IVAR minimization. For simplicity, assume that we are designing experiments on a one¬ 
dimensional input domain and that our experimental design is x = [aq,... xjv]- Now the 
gradient for a given X{ (see (3.4)) may be computed as 


(C.l) 


V x c(xi\x) = -V x (^K(x,Xi) T R.K(x,x i ) S j 


N N 

= -V* ( y y K(xj,Xi)K[j,k]K(xk,‘. 

\j=i k=i 
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Let us focus on a single element of the gradient: 


dc(xi\x) 

dxe 


N N 


d 


EE — (K(xj,Xi)K\j, k\K(x k ,Xi )) 

j= 1 k =1 " 

N N 


=-ee( ( Xj,Xi )) R[j, - E E ^ ( x p x i) w— (R[j, A:]iL(x fe , x*)) 

j=ifc=i ' j=i fc=i 

/ \ ^ ^ ^ 

= 2 ( K(x £ ,Xi)) ^2 R[£, fc]A'(x fc ,Xi) - EE K ( Xj,Xi ) (R[j, A:]) K(x k ,xi) 

K ° X£ J k=i j=\ k=\ 0X1 


N N 


d 


where the second equality follows from the chain rule and the third equality follows from 
the symmetry of K and R as well as another application of the chain rule. We are left 
with terms involving derivatives of K(x,x') and of the elements of R. To evaluate the 
latter, use the identity = — R f ^ R and recall that R -1 [z,j] = K(xi,Xj) + 8ija 2 . 
Hence this quantity can also be computed from the derivative of K. For instance, if K 
is a squared exponential kernel K(xi,Xj) = exp (— (xi — Xj) 2 /2l 2 ) then the derivative is 
dK(xi,Xj)/dxi = — h(xi — Xj)K(xi,Xj). An analogous derivation can be performed for a 
multi-dimensional input space. 
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